Red Wine Data Exploration by Zackaria Mufti

Intro: This report explores a dataset consisting of the chemical properties and quality ratings of 1599 red wines. I will look at different visualizations of the dataset to look at the relationships between the chemical properties in wine at univariate, bivariate, and multivariate levels.

Univariate Plots Section

## [1] 1599   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

I see that most wines in this data set are rated at 5 or 6 out of 10. I will divide wine quality into 3 quality rating classes, low (3-4), average (5-6), and high (7-8) and try a similar visualization.

Most wines in the dataset appear to be of average quality. Looking at other univariate plots, I might see some similarities in other variables.

Alcohol content in wines appears to be right skewed. Most wines in this data set seem to have average to lower levels of alcohol content.

Total sulfur dioxide levels in wines are right-skewed heavily. Most wines appear to have low levels of sulfur dioxide.

pH seems to be relatively symmetric, and pH does not seem to vary too much.

Fixed acidity seems to have a slight right skew, nothing too interesting here

Volatile acidity may be bimodal around the mean, maybe there is a range of volatile acidity levels that give wine an average quality.

Residual sugar seems to be low in most wines, but there are some high levels of residual sugar which are far from the mean and median levels.

Chlorides seems to be low in most wines, and are mostly symmetric in distribution, but there are some high outliers of chlorides which are far from the mean and median levels.

Density of wines appears to be more or less normally distributed.

Sulphates appear to be slightly right skewed, maybe average and lower levels of sulphates could relate to average and low alcohol ratings.

Univariate Analysis

Tip: Now that you’ve completed your univariate explorations, it’s time to reflect on and summarize what you’ve found. Use the questions below to help you gather your observations and add your own if you have other thoughts!

What is the structure of your dataset?

This data set consists of 1599 rows and 14 columns. Most wines in the dataset seem to be average quality and so many of the connections draw between the chemical properties and wine quality may be based on what makes an wine average.

What is/are the main feature(s) of interest in your dataset?

The most interesting thing with this dataset is to figure out what makes a wine have good or bad quality, or average quality. Is it a specific chemical property that makes wine objectively good? Or is wine tasting more a subjective topic that shouldn’t be based on chemical characteristics?

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

It would be interesting to see how the variables of chemical properties affect wine quality within this data set. I am specifically intereseted to see how alcohol content affects perception of wine quality.

Did you create any new variables from existing variables in the dataset?

I created a new variable quality_class, in order to group wine rating scores into groups of quality (low, average, and high).

Of the features you investigated, were there any unusual distributions?

Many chemical characteristics seemed to have right skews in their distribution (alchohol, total sulfur dioxide, fixed acidity, volatile acidity, and sulphates). Density and pH, on the other hand, appeared more normally distributed. Chlorides have a few outliers on the right, and the same for residual sugar.

Bivariate Plots Section

Tip: Based on what you saw in the univariate plots, what relationships between variables might be interesting to look at in this section? Don’t limit yourself to relationships between a main output feature and one of the supporting variables. Try to look at relationships between supporting variables as well.

Looking at this correlation visualization, there are some interesting correlations, those with the variable of interest (quality), and other strong correlations between other variables.

  1. Quality and other variables:
  1. Alcohol and quality have a moderate positive correlation.
  2. Quality and sulphates have a weaker positive correlation.
  3. Volatile acidity and quality have a moderate negative correlation.
  1. Some interesting stronger correlations:
  1. Fixed acidity and pH have a strong negative correlation.
  2. Density and fixed acidity have a strong positive correlation.

We can look closer at these correlations in scatter plots.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$alcohol and wine$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

Due to some variation in the data, the mean is not very consistent (dropping at 9%, 10%, and 12% alcohol). However, a positive trend is clearly visible and there is evidence of a positive correlation between alcohol and quality with a correlation coefficient of 0.476.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$volatile.acidity and wine$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

The mean line in this scatter plot has a clear downward trend where the data is most dense. For higher levels of volatile acidity, there are less data points, yet a moderate negative correlation is still represented (-0.39).

## 
##  Pearson's product-moment correlation
## 
## data:  wine$sulphates and wine$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971

Sulphates and quality seem to have a weaker correlation, but still statistically significant, with a correlation coefficient of 0.25. Maybe the combination of higher levels of sulphates and higher levels of alcohol will have a stronger correlation with higher wine quality in a multivariate analysis.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$fixed.acidity and wine$pH
## t = -37.366, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7082857 -0.6559174
## sample estimates:
##        cor 
## -0.6829782

This negative correlation is moderately strong between fixed acidity and pH (-0.683). This makes sense intuitively because low levels of pH correspond to low levels of acidity.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$fixed.acidity and wine$density
## t = 35.877, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6399847 0.6943302
## sample estimates:
##       cor 
## 0.6680473

Another strong correlation, density and fixed acidity seem to have a significant positive correlation (0.668). Acidity seems to have a positive impact on density of a liquid.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

It was interesting to see that the strongest correlation involved with the variable of interest, quality, was with alcohol. Does alcohol affect the taste of a wine enough to make a wine expert rate a wine higher on the scale? or could the level of alcohol in the wine influence the perception of that wine? If alcohol was the deciding factor in wine quality, why wouldn’t wine producers just increase the level of alcohol in their cheaper wines and sell them for a much higher price?

Also interesting to note was that sulphates also had a positive correlation with wine quality. Maybe alcohol and sulphates together can heavily impact the quality of a wine. Volatile acidity seems to be moderately negatively correlated with wine quality and may have interactions with alcohol as well. I will investigate these findings in multivariate analysis.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

It was interesting to see evidence in the wine data of the known relationship between pH and acidity (low levels of pH mean high acidity), where fixed acidity and pH were strongly negatively correlated.

What was the strongest relationship you found?

The strongest relationship was between fixed acidity and density. It seems the more acidic a wine is, the higher the density of the wine.

Multivariate Plots Section

I will investigate further the how alcohol and some other variables interect in affecting the quality rating of wines.

Looking at this plot, it seems that alcohol that wine quality is lower at the bottom right, where sulphates and alcohol are both low. Wine quality seems to be hight closer to the upper right of the data, where alcohol is higher and so is the sulphate level. Alcohol seems to have more of an effect on quality but alcohol and sulphate levels together do seem to have a positive impact on wine quality.

Looking at sulphates and alcohol again above, I have split quality rating into different scatter plots to see more clearly. With this visualization, an overall low level to higher level trend can be seen in each of the variables, sulphates, alcohol, and quality.

Looking at the above plot of volatile acidity and alcohol in relation to quality, it seems that the lighter colors (low to average quality levels) are mostly in the upper left, while the darker blues (higher quality levels) are mostly in the bottom right. I will split this plot up again to better see the relationship.

Looking above, it is clear that there is the trend that as volatile acidity decreases and alcohol increases, wine quality increases. I will put these variables in a linear model to generate coefficient values for the relationship with quality.

## 
## Call:
## lm(formula = quality ~ alcohol + sulphates + volatile.acidity, 
##     data = wine)
## 
## Coefficients:
##      (Intercept)           alcohol         sulphates  volatile.acidity  
##           2.6108            0.3092            0.6790           -1.2214
## 
## Call:
## lm(formula = quality ~ alcohol + sulphates + volatile.acidity, 
##     data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7186 -0.3820 -0.0641  0.4746  2.1807 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.61083    0.19569  13.342  < 2e-16 ***
## alcohol           0.30922    0.01580  19.566  < 2e-16 ***
## sulphates         0.67903    0.10080   6.737 2.26e-11 ***
## volatile.acidity -1.22140    0.09701 -12.591  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6587 on 1595 degrees of freedom
## Multiple R-squared:  0.3359, Adjusted R-squared:  0.3346 
## F-statistic: 268.9 on 3 and 1595 DF,  p-value: < 2.2e-16

Looking at the model results, wine quality increases fractionally (0.3 units with alcohol and 0.68 units with sulphates) for every unit increase in alcohol and sulphates. And wine quality decreases (-1.22 units) for every unit increase in volatile acidity.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

It seems that alcohol and sulphates combined, have a significant postive impact on wine quality. That is, with more alcohol and sulphates, wine quality seems to go up in rating. Also, when alcohol and volatile acidity are combined, it seems wine quality increases as long as volatile acidity is decreasing and alcohol is increasing.

Were there any interesting or surprising interactions between features?

It is interesting that volatile acidity does not help the quality of wine. I would think from experience that a lack of acid would make a wine taste flat, and reduce the quality of the wine. However, from further research (http://waterhouse.ucdavis.edu/whats-in-wine/volatile-acidity), different acids will affect the taste of a wine differently. The acid that is involved in volatile acidity, acetic acid, comes from bacteria and is more prevalent in dessert wines than in common wines, and volatile acidity levels are generally kept low otherwise through processes like reverse osmosis. If we were to assume that most of the wines in this dataset are not dessert wines, then it makes sense that lower quality (and likely cheaper) wines would have higher levels of volatile acidity, because it costs more money to remove acetic acid than to to leave it in the wine and sell at a lower price. It is hard to conclude anything because the types of wines are not known; however, it is interesting to think about the interaction between volatile acidity and wine quality.


Final Plots and Summary

Plot One

Description One

This plot may seem simple, but it is one of the most important plots of this dataset. It is clear to see that a dominant majority of the wines in this dataset are of average quality (ratings 5 and 6) and there are few numbers of low and high quality wines. There are multiple things to consider in the analysis of the data given this information:

–Could it be that the scale of quality rating doesn’t have a large enough range to accurately rate wine quality?

–If the wines are rated accurately, can we still draw conclusions about what makes a good wine if most of the data consists of average quality wine?

–Is the data set large enough? or is very high occcurence of average wines in this sample consistent proportionally with the occurence of average wines in the total population of wines?

–We may have to assume that the dataset is accurate and complete enough for this analysis and only make tentative inferences from it about red wines in general.

Plot Two

Description Two

This plot is important in revealing the variable most correlated with red wine quality in the dataset, alcohol. It is very interesting to see that other chemical characteristics weren’t more related to quality than alcohol. Looking at the plot, wine quality seems to increase with alcohol percentage. There multiple things to consider given this information:

–Does alcohol affect the taste of a wine enough to cause quality rating to go up?

–Could it be that alcohol percentage is influencing the perception of the wines in the tasting process by the wine experts?

–Or is it the process that produces higher levels of alcohol in wine that makes it taste better? meaning that it could be alcohol and other chemical properties produced in the process combined that make a wine better quality.

Plot Three

Description Three

This plot looks at the effect of alcohol on wine quality at a deeper level. Here, we can see how sulphates and alcohol are correlated with wine quality at the same time. From the plot, there is a clear movement from lower levels of sulphate concentration and alcohol percentage to higher levels as wine quality goes up (yellow to red). There are multiple things to consider with this information in mind:

–Higher levels of sulphate concentration and alcohol percentage seem to indicate better quality red wines.

–Proportionally, it looks like alcohol percentage is slightly more dominant in its influence on wine quality than sulphate concentration.

–Perhaps the taste of alcohol and sulphates interact well together and make a wine taste better?

–Or perhaps sulphates are produced as alcohol increases during the wine production process?


Reflection

The main struggle in this analysis for me was dealing with the thought that the data would be inadequate for making reasonable conclusions. This came mostly from finding that most of the wines in the dataset were of average quality. However, as the analysis continued, there were strong enough correlations with higher levels of wine quality to produce intersting results. It was very interesting and very surprising to see how alcohol, and later, alcohol and sulphates, could influence the perceived quality of a wine. Although these conclusions were interesting and were significant in the dataset, it might be better for making inferences about the total population of red wines, to obtain a larger dataset with more variation between low, average, and high quality wines. Also, more investigation could be done in the future by researching the wine production process in depth, and having an idea about how sulphates and alcohol are involved, whether they are added to wine purposefully for taste reasons, or if they chemically arise together during the production process. Then, one could investigate further the relationship with sulphate concentration, alcohol percentage, and red wine quality.